Hadoop is an open-source framework designed to store and process large datasets across clusters of computers using simple programming models. It is highly scalable, allowing for the expansion from a single server to thousands of machines, each offering local computation and storage.
At the heart of Hadoop are two main components: the Hadoop Distributed File System (HDFS) and MapReduce. HDFS is responsible for storing data across the cluster efficiently, while MapReduce handles the processing of this data. MapReduce divides tasks into small parts, assigning them to multiple nodes in a cluster, significantly speeding up processing times.
This framework excels in handling vast amounts of data, providing a reliable, fault-tolerant software architecture. Hadoop has become a key technology in the field of big data analytics, supporting a wide range of applications and services.
This notebook includes tasks given for the project, since Operating System is MacOS, limitations included
Happy Reading !!
Ganesh
For this part, since we are on MacOs, we will be using UTM Virtual Machine to install hadoop
Integrating Cloudera into macOS Using UTM Virtualization: An Efficient Methodology
Integrating Cloudera on a MacBook with Apple Silicon can be effectively achieved using UTM (Universal Turing Machine) for virtualization. UTM enables the creation and management of virtual machines (VMs) on macOS, allowing for the deployment of complex data environments such as Cloudera without necessitating traditional dual-boot or hardware-based virtualization techniques.
The process begins with the installation of UTM, which is available either for free from its official website or through the Mac App Store. The latter option provides automatic updates and dedicated support. Once UTM is installed, the next step involves preparing for the Cloudera installation.
Instead of installing a full Linux operating system, one can utilize UTM’s capability to import pre-configured virtual machine environments directly from its library. This is achieved by navigating to the UTM Gallery, where various VMs are available for download. It is crucial to select a VM that is compatible with Cloudera, ensuring it meets the necessary specifications such as sufficient CPU allocation, RAM, and storage, and is ARM-compatible, aligning with the architecture of Apple Silicon.
After the appropriate VM is imported into UTM, the Cloudera package can then be downloaded and installed within this virtual environment. This approach not only facilitates the use of Cloudera’s powerful data management tools on a MacBook but also leverages the advanced hardware capabilities of Apple Silicon, combined with the versatility of UTM’s virtualization features. By virtualizing a compatible environment via UTM, the integration of robust data management functionalities into the macOS ecosystem is seamlessly accomplished.
Next, we shall check the Version of hadoop installed
#Checks the hadoop version hadoop version
#Starting hadoop and resource checks jps sudo jps
We can see that all resources are running succesfully for the Hadoop version we installed
Mauritius, Mauritius, a jewel in the ocean, Mauritius, where the waters of the Indian Ocean lap gently at golden sands. Mauritius, an island of unparalleled beauty, where each beach tells a story, a story of serene mornings and fiery sunsets, of tranquil bays and lively shores. In Mauritius, the beaches are not just stretches of sand; they are canvases of nature's artistry, where the turquoise sea meets the azure sky in a horizon that stretches beyond imagination.
But Mauritius, oh Mauritius, is more than its beaches. The roads of Mauritius, winding and weaving through lush landscapes, tell tales of adventure and discovery. These roads, they guide you from the quiet coastal villages to the bustling heart of the cities, each turn a promise of new sights and sounds. In Mauritius, the roads are like ribbons tying the island's diverse experiences together, from the tranquil to the vibrant, from the natural to the crafted.
And the people, the people of Mauritius, are the soul of the island. Warm smiles, open hearts, a community where every visitor feels like they belong. In Mauritius, hospitality isn't just an act; it's a way of life. Here, in Mauritius, every handshake is friendly, every greeting sincere. The people of Mauritius, with their rich tapestry of cultures, make the island not just a place to visit, but a place to love, a place to return to.
In Mauritius, the rhythm of life is a melody composed of the crashing waves, the whispering winds, and the laughter of its people. Life in Mauritius is a dance, a dance that invites everyone to join, under the sun and stars, along the beaches, down the roads, and in the hearts of every home.
Mauritius, Mauritius, where every day is a celebration of nature's bounty and human kindness. Mauritius, where the beauty of the land shapes the spirit of its people, resilient and welcoming, proud yet humble. In Mauritius, you don't just see the sights; you feel them, you become part of them. Mauritius, where the beaches are your sanctuary, the roads your journey, and the people your family.
To visit Mauritius is to experience a world where every moment is a treasure, every encounter a memory. Mauritius, an island not just of scenic vistas but of deep, heartfelt connections. Here, in Mauritius, you don't say goodbye; you say, "Until next time," because Mauritius, with its beaches, its roads, and its people, stays with you, long after you've left its shores.
Mauritius, Mauritius, a chorus repeated, a refrain that echoes in the soul of those who've walked its beaches, traversed its roads, and met its people. Mauritius, a call, a promise, a home away from home.
Adding External JARs option on the Right Hand Side
// Importing libraries import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter; public class WCMapper extends MapReduceBase implements Mapper { // Map function public void map(LongWritable key, Text value, OutputCollector output, Reporter rep) throws IOException { String line = value.toString(); // Splitting the line on spaces for (String word : line.split(" ")) { if (word.length() > 0) { output.collect(new Text(word), new IntWritable(1)); } } } }
// Importing libraries import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; public class WCReducer extends MapReduceBase implements Reducer { // Reduce function public void reduce(Text key, Iterator value, OutputCollector output, Reporter rep) throws IOException { int count = 0; // Counting the frequency of each words while (value.hasNext()) { IntWritable i = value.next(); count += i.get(); } output.collect(key, new IntWritable(count)); } }
// Importing libraries import java.io.IOException; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class WCDriver extends Configured implements Tool { public int run(String args[]) throws IOException { if (args.length < 2) { System.out.println("Please give valid inputs"); return -1; } JobConf conf = new JobConf(WCDriver.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setMapperClass(WCMapper.class); conf.setReducerClass(WCReducer.class); conf.setMapOutputKeyClass(Text.class); conf.setMapOutputValueClass(IntWritable.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); return 0; } // Main Method public static void main(String args[]) throws Exception { int exitCode = ToolRunner.run(new WCDriver(), args); System.out.println(exitCode); } }
# Code Selection import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
External Archive Libraries were added, - Hadoop Common Jar - Hadoop Core
Now, lets put our text file with the mauritius poem in the hadoop environment
Renaming and saving the jar File
When refining a MapReduce word count job, like the one you've displayed, it's crucial to address several areas for improvement. Text normalization should be implemented to ensure 'Mauritius' and 'Mauritius,' are counted as the same word by converting all text to lower case and removing punctuation. This avoids skewed word counts due to case sensitivity and punctuation. It's also important to optimize data transfer between map and reduce phases; incorporating a combiner function can aggregate word counts earlier in the process, significantly reducing the volume of data shuffled across the network. Additionally, to prevent bottlenecks, ensure keys are evenly distributed across reducers for load balancing and use incremental aggregation within mappers when possible. Resource tuning is vital—allocate sufficient memory and CPU to tasks based on the workload to enhance performance. Scalability is a key consideration; the job should maintain efficiency with increasing data volumes, necessitating regular testing and adjustments. Lastly, robust error handling should be in place to contend with data inconsistencies or unexpected operational issues, ensuring the job's resilience and accuracy. By addressing these aspects, the MapReduce job can be made more efficient, resulting in quicker, more reliable word counting.
Now, lets run through the same process as above and see the results
Create (write) a MapReduce program that joins students and grades data from the attached files and outputs the student names and their grades
# Map Reduce Program to merge files using python (Mapreduce framwork simulation)
#Defining paths to the files
students_file_path = 'students.txt'
grades_file_path = 'grades.txt'
#Reading the files
with open(students_file_path, 'r') as file:
students_data = file.readlines()
with open(grades_file_path, 'r') as file:
grades_data = file.readlines()
#Mapper function
def mapper(records):
#Mapping each record to (student_id, value) pairs
return [(record.split()[0], ' '.join(record.split()[1:])) for record in records]
#Map phase
mapped_students = mapper(students_data)
mapped_grades = mapper(grades_data)
# Shuffle and Sort phase (simulated)
#Concatenate mapped data and sort by student ID (key)
all_mapped_data = sorted(mapped_students + mapped_grades, key=lambda x: x[0])
#Reduce function
def reducer(sorted_mapped_data):
reduced_data = {}
# Combine name and grade for the same student ID
for student_id, value in sorted_mapped_data:
if student_id not in reduced_data:
reduced_data[student_id] = []
reduced_data[student_id].append(value)
#Generate final output
results = []
for student_id, values in reduced_data.items():
if len(values) == 2: # Assuming each ID has exactly one name and one grade
results.append(f"{values[0]}, {values[1]}")
return results
#Reduce phase
final_output = reducer(all_mapped_data)
#Displaying the results
for item in final_output:
print(item)
Gagandeep Singh, 85 Jaspreet Kaur, 65 Samantha Smith, 90 Paolo Souza, 80 Shishandeep Kaur, 75 John Martin, 54 Arshdeep Kaur, 70 Ingrid Weber, 70 Manpreet Singh, 55 Amandeep Kaur, 65
File Reading: The script starts by defining paths to two files: students.txt and grades.txt. Each file is opened, and its content is read into separate lists students_data and grades_data.
Mapper Function:
mapper function takes a list of records (lines from either file) and processes each line to extract the student ID (assumed to be the first element of each line, separated by space) and the remaining part of the line.Map Phase:
mapped_students and mapped_grades that hold the mapped data from each respective file.Shuffle and Sort Phase:
all_mapped_data and then sorted by student ID. This sorting simulates the Shuffle and Sort phase of MapReduce where data is shuffled so all records concerning the same key (student ID) are brought together.Reduce Function:
reducer function processes the sorted list. It consolidates entries with the same student ID by appending them into a list under the corresponding key in a dictionary reduced_data.Reduce Phase:
reducer function on the sorted and consolidated data, producing a final list final_output of formatted strings that combine student names and grades.Output:
final_output, displaying the merged results of student names with their corresponding grades.This script is particularly useful in educational data management where combining records from different sources based on student IDs is necessary. For instance, merging exam grades with student registries for report generation. This simulation provides a basic understanding of how large-scale data processing frameworks like Hadoop implement MapReduce to handle vast amounts of data efficiently. This example could be expanded to handle more complex data structures and larger datasets, demonstrating the scalability and robustness of the MapReduce model.